Skip to main content

Corporation Websites Collection

This collection contains an extracted web archive corpus of 0.8+ million corporate websites (from an original list of ~0.98 websites) extracted from the archive.org web archive, covering the period 1996 to early 2017. This corpus was originally created as a collaboration between the Internet Archive and a group at Dartmouth University, but it may be useful to other researchers.

Updated or more detailed information may exist at:

Corpus Statistics

  • approximately 840,000 domains
  • more than 500,000,000 unique URLs
  • more than 1,600 WARC files and 26,000 ARC files
  • more than 3.25 TB compressed

Content

This dataset contains a sample of up to 500 unique text/html URLs per year for given websites. The sample records were selected based on the sort order of the URL strings and extracted and packaged into (W)ARC files. The breakdown of the number of captures per year per domain is available here:

How to Download

This large corpus is split into several hundred distinct "items" on archive.org.

We recommend using the internetarchive python utility (aka, "ia") for bulk downloads. See also:

WARC and ARC downloads can be verified using the unified manifest files at:

Additional manifest and derived information (including CDX files) exist at:


PART OF
Web Data Services
More right-solid
SHOW DETAILS
up-solid down-solid
eye
Title
Date Reviewed
Creator
Corporation Websites Collection
data
eye 6,449
favorite 0
comment 0
Corporation Websites Collection
data
eye 2,110
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,826
favorite 0
comment 0
Corporation Websites Collection
data
eye 5,064
favorite 0
comment 0
Corporation Websites Collection
data
eye 2,254
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,049
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,294
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,954
favorite 0
comment 0
Corporation Websites Collection
data
eye 944
favorite 0
comment 0
Corporation Websites Collection
data
eye 715
favorite 0
comment 0
Corporation Websites Collection
data
eye 783
favorite 0
comment 0
Corporation Websites Collection
data
eye 760
favorite 0
comment 0
Corporation Websites Collection
data
eye 919
favorite 0
comment 0
Corporation Websites Collection
data
eye 866
favorite 0
comment 0
Corporation Websites Collection
data
eye 721
favorite 0
comment 0
Corporation Websites Collection
data
eye 860
favorite 0
comment 0
Corporation Websites Collection
data
eye 2,304
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,888
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,836
favorite 0
comment 0
Corporation Websites Collection
data
eye 920
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,226
favorite 0
comment 0
Corporation Websites Collection
data
eye 939
favorite 0
comment 0
Corporation Websites Collection
data
eye 738
favorite 0
comment 0
Corporation Websites Collection
data
eye 2,430
favorite 0
comment 0
Corporation Websites Collection
data
eye 735
favorite 0
comment 0
Corporation Websites Collection
data
eye 727
favorite 0
comment 0
Corporation Websites Collection
data
eye 668
favorite 0
comment 0
Corporation Websites Collection
data
eye 903
favorite 0
comment 0
Corporation Websites Collection
data
eye 812
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,473
favorite 0
comment 0
Corporation Websites Collection
data
eye 781
favorite 0
comment 0
Corporation Websites Collection
data
eye 934
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,954
favorite 0
comment 0
Corporation Websites Collection
data
eye 380
favorite 0
comment 0